In [1]:
pip install --upgrade wordcloud pillow
Requirement already up-to-date: wordcloud in c:\users\admin\anaconda3\lib\site-packages (1.9.3)
Requirement already up-to-date: pillow in c:\users\admin\anaconda3\lib\site-packages (10.3.0)
Requirement already satisfied, skipping upgrade: matplotlib in c:\users\admin\anaconda3\lib\site-packages (from wordcloud) (3.2.2)
Requirement already satisfied, skipping upgrade: numpy>=1.6.1 in c:\users\admin\anaconda3\lib\site-packages (from wordcloud) (1.18.5)
Requirement already satisfied, skipping upgrade: kiwisolver>=1.0.1 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib->wordcloud) (1.2.0)
Requirement already satisfied, skipping upgrade: python-dateutil>=2.1 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib->wordcloud) (2.8.1)
Requirement already satisfied, skipping upgrade: cycler>=0.10 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib->wordcloud) (0.10.0)
Requirement already satisfied, skipping upgrade: pyparsing!=2.0.4,!=2.1.2,!=2.1.6,>=2.0.1 in c:\users\admin\anaconda3\lib\site-packages (from matplotlib->wordcloud) (2.4.7)
Requirement already satisfied, skipping upgrade: six>=1.5 in c:\users\admin\anaconda3\lib\site-packages (from python-dateutil>=2.1->matplotlib->wordcloud) (1.15.0)
Note: you may need to restart the kernel to use updated packages.
In [2]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import plotly.express as px
from wordcloud import WordCloud, STOPWORDS, ImageColorGenerator
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveRegressor
import warnings
warnings.filterwarnings('ignore')
%matplotlib inline

import plotly.offline as pyo
data = pd.read_csv("Instagram data.csv", encoding = 'latin1')

print(data.head())
   Impressions  From Home  From Hashtags  From Explore  From Other  Saves  \
0         3920       2586           1028           619          56     98   
1         5394       2727           1838          1174          78    194   
2         4021       2085           1188             0         533     41   
3         4528       2700            621           932          73    172   
4         2518       1704            255           279          37     96   

   Comments  Shares  Likes  Profile Visits  Follows  \
0         9       5    162              35        2   
1         7      14    224              48       10   
2        11       1    131              62       12   
3        10       7    213              23        8   
4         5       4    123               8        0   

                                             Caption  \
0  Here are some of the most important data visua...   
1  Here are some of the best data science project...   
2  Learn how to train a machine learning model an...   
3  Here’s how you can write a Python program to d...   
4  Plotting annotations while visualizing your da...   

                                            Hashtags  
0  #finance #money #business #investing #investme...  
1  #healthcare #health #covid #data #datascience ...  
2  #data #datascience #dataanalysis #dataanalytic...  
3  #python #pythonprogramming #pythonprojects #py...  
4  #datavisualization #datascience #data #dataana...  
In [3]:
#Before starting everything, let’s have a look at whether this dataset contains any null values or not:
data.isnull().sum()
Out[3]:
Impressions       0
From Home         0
From Hashtags     0
From Explore      0
From Other        0
Saves             0
Comments          0
Shares            0
Likes             0
Profile Visits    0
Follows           0
Caption           0
Hashtags          0
dtype: int64
In [4]:
#So it has a null value in every column. Let’s drop all these null values and move further:
data = data.dropna()
In [5]:
#Let’s have a look at the insights of the columns to understand the data type of all the columns:
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 119 entries, 0 to 118
Data columns (total 13 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   Impressions     119 non-null    int64 
 1   From Home       119 non-null    int64 
 2   From Hashtags   119 non-null    int64 
 3   From Explore    119 non-null    int64 
 4   From Other      119 non-null    int64 
 5   Saves           119 non-null    int64 
 6   Comments        119 non-null    int64 
 7   Shares          119 non-null    int64 
 8   Likes           119 non-null    int64 
 9   Profile Visits  119 non-null    int64 
 10  Follows         119 non-null    int64 
 11  Caption         119 non-null    object
 12  Hashtags        119 non-null    object
dtypes: int64(11), object(2)
memory usage: 13.0+ KB
In [6]:
#Analyzing Instagram Reacplt.figure(figsize=(10, 8))
plt.style.use('fivethirtyeight')
plt.title("Distribution of Impressions From Home")
sns.distplot(data['From Home'])
plt.show()
In [7]:
#The impressions I get from the home section on Instagram shows how much my posts reach my followers. Looking at the impressions
#from home, I can say it’s hard to reach all my followers daily. Now let’s have a look at the distribution of the impressions 
#I received from hashtags:
plt.figure(figsize=(10, 8))
plt.title("Distribution of Impressions From Hashtags")
sns.distplot(data['From Hashtags'])
plt.show()
In [8]:
#Hashtags are tools we use to categorize our posts on Instagram so that we can reach more people based on the kind of content we
#are creating. Looking at hashtag impressions shows that not all posts can be reached using hashtags, but many new users can be 
#reached from hashtags. Now let’s have a look at the distribution of impressions I have received from the explore section of Instagram:

plt.figure(figsize=(10, 8))
plt.title("Distribution of Impressions From Explore")
sns.distplot(data['From Explore'])
plt.show()
In [9]:
#The explore section of Instagram is the recommendation system of Instagram. It recommends posts to the users based on their 
#preferences and interests. By looking at the impressions I have received from the explore section, I can say that Instagram 
#does not recommend our posts much to the users. Some posts have received a good reach from the explore section, but it’s still
#very low compared to the reach I receive from hashtags.

#Now let’s have a look at the percentage of impressions I get from various sources on Instagram:
home = data["From Home"].sum()
hashtags = data["From Hashtags"].sum()
explore = data["From Explore"].sum()
other = data["From Other"].sum()

labels = ['From Home','From Hashtags','From Explore','Other']
values = [home, hashtags, explore, other]

fig = px.pie(data, values=values, names=labels,
             title='Impressions on Instagram Posts From Various Sources', hole=0.5)

pyo.iplot(fig)
In [10]:
#So the above donut plot shows that almost 50 per cent of the reach is from my followers, 38.1 per cent is from hashtags, 
#9.14 per cent is from the explore section, and 3.01 per cent is from other sources.

#Analyzing Content Now let’s analyze the content of my Instagram posts. The dataset has two columns, namely caption and 
#hashtags, which will help us understand the kind of content I post on Instagram.

#Let’s create a wordcloud of the caption column to look at the most used words in the caption of my Instagram posts:
text = " ".join(i for i in data.Caption)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
plt.style.use('classic')
plt.figure( figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
In [11]:
#Now let’s create a wordcloud of the hashtags column to look at the most used hashtags in my Instagram posts:
text = " ".join(i for i in data.Hashtags)
stopwords = set(STOPWORDS)
wordcloud = WordCloud(stopwords=stopwords, background_color="white").generate(text)
plt.figure( figsize=(12,10))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis("off")
plt.show()
In [12]:
#Analyzing Relationships Now let’s analyze relationships to find the most important factors of our Instagram reach. It will
#also help us in understanding how the Instagram algorithm works.

#Let’s have a look at the relationship between the number of likes and the number of impressions on my Instagram posts:
figure = px.scatter(data_frame = data, x="Impressions",
                    y="Likes", size="Likes", trendline="ols",
                    title = "Relationship Between Likes and Impressions")

pyo.iplot(figure)
In [13]:
#There is a linear relationship between the number of likes and the reach I got on Instagram. Now let’s see the relationship
#between the number of comments and the number of impressions on my Instagram posts:
figure = px.scatter(data_frame = data, x="Impressions",
                    y="Comments", size="Comments", trendline="ols",
                    title = "Relationship Between Comments and Total Impressions")

pyo.iplot(figure)
In [14]:
#It looks like the number of comments we get on a post doesn’t affect its reach. Now let’s have a look at the relationship 
#between the number of shares and the number of impressions:
figure = px.scatter(data_frame = data, x="Impressions",
                    y="Shares", size="Shares", trendline="ols",
                    title = "Relationship Between Shares and Total Impressions")

pyo.iplot(figure)
In [15]:
#A more number of shares will result in a higher reach, but shares don’t affect the reach of a post as much as likes do. 
#Now let’s have a look at the relationship between the number of saves and the number of impressions:
figure = px.scatter(data_frame = data, x="Impressions",
                    y="Saves", size="Saves", trendline="ols",
                    title = "Relationship Between Post Saves and Total Impressions")
pyo.iplot(figure)
In [16]:
#Analyzing Conversion Rate In Instagram, conversation rate means how many followers you are getting from the number of profile 
#visits from a post. The formula that you can use to calculate conversion rate is (Follows/Profile Visits) * 100. Now let’s have a look at the conversation rate of my Instagram account:

conversion_rate = (data["Follows"].sum() / data["Profile Visits"].sum()) * 100
print(conversion_rate)
41.00265604249668
In [17]:
#So the conversation rate of my Instagram account is 31% which sounds like a very good conversation rate. Let’s have a look at 
#the relationship between the total profile visits and the number of followers gained from all profile visits:
figure = px.scatter(data_frame = data, x="Profile Visits",
                    y="Follows", size="Follows", trendline="ols",
                    title = "Relationship Between Profile Visits and Followers Gained")

pyo.iplot(figure)
In [18]:
#The relationship between profile visits and followers gained is also linear.

#Instagram Reach Prediction Model Now in this section, I will train a machine learning model to predict the reach of an 
#Instagram post. Let’s split the data into training and test sets before training the model:
x = np.array(data[['Likes', 'Saves', 'Comments', 'Shares',
                   'Profile Visits', 'Follows']])
y = np.array(data["Impressions"])
xtrain, xtest, ytrain, ytest = train_test_split(x, y,
                                                test_size=0.2,
                                                random_state=42)
In [19]:
#Now here’s is how we can train a machine learning model to predict the reach of an Instagram post using Python:
model = PassiveAggressiveRegressor()
model.fit(xtrain, ytrain)
model.score(xtest, ytest)
Out[19]:
0.6092574725869004
In [20]:
#Now let’s predict the reach of an Instagram post by giving inputs to the machine learning model:
# Features = [['Likes','Saves', 'Comments', 'Shares', 'Profile Visits', 'Follows']]
features = np.array([[282.0, 233.0, 4.0, 9.0, 165.0, 54.0]])
model.predict(features)
Out[20]:
array([8554.45342029])
In [21]:
#So this is how you can analyze and predict the reach of Instagram posts with machine learning using Python. If a content creator wants to do well on Instagram in a long run, they have to look at the data of their Instagram reach. That is where the use of Data Science in social media comes in. I hope you liked this article on the task of Instagram Reach Analysis using Python. Feel free to ask valuable questions in the comments section below.